Overview

Dataset Statistics

Number of Variables 10
Number of Rows 15472
Missing Cells 5044
Missing Cells (%) 3.3%
Duplicate Rows 137
Duplicate Rows (%) 0.9%
Total Size in Memory 8.2 MB
Average Row Size in Memory 558.5 B
Variable Types
  • Numerical: 4
  • Categorical: 6

Dataset Insights

tweet_location has 5044 (32.6%) missing values Missing
tweet_id is skewed Skewed
tweets_len is skewed Skewed
polarity is skewed Skewed
username has a high cardinality: 8364 distinct values High Cardinality
text has a high cardinality: 15259 distinct values High Cardinality
tweet_location has a high cardinality: 3299 distinct values High Cardinality
text_tokenized has a high cardinality: 12682 distinct values High Cardinality
polarity has 3891 (25.15%) negatives Negatives
polarity has 5735 (37.07%) zeros Zeros

Variables


tweet_id

numerical

Approximate Distinct Count 2962
Approximate Unique (%) 19.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 241.8 KB
Mean 5.6053e+17
Minimum 6049
Maximum 1.51e+18
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • tweet_id is skewed right (γ1 = 0.2743)

Quantile Statistics

Minimum 6049
5-th Percentile 5.6774e+17
Q1 5.6849e+17
Median 5.694e+17
Q3 5.699e+17
95-th Percentile 5.7027e+17
Maximum 1.51e+18
Range 1.51e+18
IQR 1.407e+15

Descriptive Statistics

Mean 5.6053e+17
Standard Deviation 1.4148e+17
Variance 2.0018e+34
Sum 8.6725e+21
Skewness 0.2743
Kurtosis 20.5413
Coefficient of Variation 0.2524
  • tweet_id is not normally distributed (p-value 4.5621730566875295e-25)
  • tweet_id has 832 outliers

username

categorical

Approximate Distinct Count 8364
Approximate Unique (%) 54.1%
Missing 0
Missing (%) 0.0%
Memory Size 1.1 MB
  • The largest value (JetBlueNews) is over 1.66 times larger than the second largest value (Flight_Refunds)

Length

Mean 10.5779
Standard Deviation 2.6281
Median 11
Minimum 2
Maximum 19

Sample

1st row cairdin
2nd row jnardino
3rd row yvonnalynn
4th row jnardino
5th row jnardino

Letter

Count 152805
Lowercase Letter 135102
Space Separator 7
Uppercase Letter 17703
Dash Punctuation 0
Decimal Number 8741
  • username contains many words: 8367 words
  • The largest value (jetbluenews) is over 1.66 times larger than the second largest value (flight_refunds)

text

categorical

Approximate Distinct Count 15259
Approximate Unique (%) 98.6%
Missing 0
Missing (%) 0.0%
Memory Size 2.7 MB

Length

Mean 106.69
Standard Deviation 40.8017
Median 115
Minimum 9
Maximum 472

Sample

1st row @VirginAmerica Wha...
2nd row @VirginAmerica plu...
3rd row @VirginAmerica I d...
4th row @VirginAmerica it'...
5th row @VirginAmerica and...

Letter

Count 1277859
Lowercase Letter 1191233
Space Separator 268132
Uppercase Letter 86626
Dash Punctuation 2071
Decimal Number 19847
  • text contains many words: 17436 words
  • The largest value (i) is over 1.51 times larger than the second largest value (united)

airline_name

categorical

Approximate Distinct Count 16
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Memory Size 1.1 MB

Length

Mean 7.8239
Standard Deviation 2.2268
Median 8
Minimum 5
Maximum 15

Sample

1st row Virgin America
2nd row Virgin America
3rd row Virgin America
4th row Virgin America
5th row Virgin America

Letter

Count 117530
Lowercase Letter 95674
Space Separator 3417
Uppercase Letter 21856
Dash Punctuation 0
Decimal Number 0

airline_sentiment

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 1.1 MB
  • The largest value (negative) is over 2.99 times larger than the second largest value (neutral)

Length

Mean 7.7898
Standard Deviation 0.4075
Median 8
Minimum 7
Maximum 8

Sample

1st row neutral
2nd row positive
3rd row neutral
4th row negative
5th row negative

Letter

Count 120524
Lowercase Letter 120524
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (negative, neutral) take over 50.0%
  • The largest value (negative) is over 2.99 times larger than the second largest value (neutral)

tweet_location

categorical

Approximate Distinct Count 3299
Approximate Unique (%) 31.6%
Missing 5044
Missing (%) 32.6%
Memory Size 815.5 KB

Length

Mean 13.5077
Standard Deviation 6.482
Median 13
Minimum 1
Maximum 53

Sample

1st row Lets Play
2nd row San Francisco CA
3rd row Los Angeles
4th row San Diego
5th row Los Angeles

Letter

Count 113881
Lowercase Letter 86446
Space Separator 13543
Uppercase Letter 27435
Dash Punctuation 423
Decimal Number 3753
  • tweet_location contains many words: 2672 words

text_tokenized

categorical

Approximate Distinct Count 12682
Approximate Unique (%) 82.0%
Missing 0
Missing (%) 0.0%
Memory Size 1.5 MB
  • The largest value ([]) is over 13.59 times larger than the second largest value (['fleet', 'fleek'])

Length

Mean 38.3303
Standard Deviation 25.9966
Median 35
Minimum 2
Maximum 236

Sample

1st row []
2nd row ['add', 'commercia...
3rd row ['another']
4th row ['aggressive', 'bl...
5th row []

Letter

Count 364449
Lowercase Letter 364449
Space Separator 41686
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 2548
  • text_tokenized contains many words: 8072 words

tweets_len

numerical

Approximate Distinct Count 274
Approximate Unique (%) 1.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 241.8 KB
Mean 106.69
Minimum 9
Maximum 472
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • tweets_len is skewed right (γ1 = 0.128)

Quantile Statistics

Minimum 9
5-th Percentile 34
Q1 78
Median 115
Q3 137
95-th Percentile 149
Maximum 472
Range 463
IQR 59

Descriptive Statistics

Mean 106.69
Standard Deviation 40.8017
Variance 1664.7747
Sum 1.6507e+06
Skewness 0.128
Kurtosis 1.4998
Coefficient of Variation 0.3824
  • tweets_len is not normally distributed (p-value 5.705786204054169e-10)
  • tweets_len has 171 outliers

tweets_word_count

numerical

Approximate Distinct Count 60
Approximate Unique (%) 0.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 241.8 KB
Mean 18.2009
Minimum 1
Maximum 61
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • tweets_word_count is skewed right (γ1 = 0.3557)

Quantile Statistics

Minimum 1
5-th Percentile 5
Q1 13
Median 19
Q3 23
95-th Percentile 28
Maximum 61
Range 60
IQR 10

Descriptive Statistics

Mean 18.2009
Standard Deviation 7.776
Variance 60.4661
Sum 281604
Skewness 0.3557
Kurtosis 1.3966
Coefficient of Variation 0.4272
  • tweets_word_count is not normally distributed (p-value 0.00027397573597959593)
  • tweets_word_count has 221 outliers

polarity

numerical

Approximate Distinct Count 1644
Approximate Unique (%) 10.6%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 241.8 KB
Mean 0.04946
Minimum -1
Maximum 1
Zeros 5735
Zeros (%) 37.1%
Negatives 3891
Negatives (%) 25.1%
  • polarity is skewed left (γ1 = -0.0398)

Quantile Statistics

Minimum -1
5-th Percentile -0.5
Q1 -0.01205
Median 0
Q3 0.2
95-th Percentile 0.625
Maximum 1
Range 2
IQR 0.2121

Descriptive Statistics

Mean 0.04946
Standard Deviation 0.3257
Variance 0.1061
Sum 765.2109
Skewness -0.03981
Kurtosis 1.8307
Coefficient of Variation 6.5848
  • polarity is not normally distributed (p-value 4.1011676626885485e-24)
  • polarity has 2560 outliers

Interactions

Correlations

Missing Values